[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. #608

anj-s · 2021-04-14T19:44:26Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

checkpoint_activation was removed incorrectly. Reverted SyncShard change since it makes the code more readable inspite of having another path. Working on code refactoring but wanted to get this change checked in the meantime.
Added tests since we did not detect the missing codepath
Modified benchmarks to account for the checkpoint_activation flag. Will be adding a benchmark in an upcoming PR once I have a few more run tabulated.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

blefaudeux · 2021-04-14T20:36:25Z

fairscale/experimental/nn/offload.py

@@ -292,6 +292,75 @@ def backward(ctx, *grad_outputs):  # type: ignore
        return (None, None) + grads


+class ShardSyncLayer(torch.autograd.Function):


at least this part I'm a bit familiar with :)

blefaudeux · 2021-04-14T20:37:20Z

fairscale/experimental/nn/offload.py

@@ -386,4 +455,23 @@ def forward(self, *inputs: Any, **_: Any) -> Any:

        # We need the second param to be a dummy input to enable the
        # backward pass to be triggered for integer inputs.
-        return ActivationCheckpointing.apply(*inputs, torch.tensor([], requires_grad=True), self)
+        if self._checkpoint_activation:


oh, I must have reviewed the offending PR and missed that, sorry about that

No worries! I realized that the tests weren't really catching this so glad I realized it.

blefaudeux · 2021-04-14T20:38:03Z

tests/experimental/nn/test_offload.py

+        offload_model.train()
+        pred = offload_model(input)
+        loss_fn = torch.nn.MSELoss(reduction="sum")
+        loss = loss_fn(pred, labels)


checking elsewhere for some form of parity ? wondering just in case

blefaudeux

LGTM especially since revert, else some missing bits of the big picture, checking 1:1

Anjali Sridhar added 3 commits April 14, 2021 18:07

revert change made

36820dc

add tests and revert sync shard changes

3adfe9b

add tests

d869d91

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2021

Anjali Sridhar added 6 commits April 14, 2021 19:46

remove file checked in by error

4b25a7c

inine var

ec10bf0

fix lint errors

701099a

add checkpoint activation

8784d1d

fix mypy

886cf99

use a bigger model

181aef2

anj-s requested a review from blefaudeux April 14, 2021 20:32

blefaudeux reviewed Apr 14, 2021

View reviewed changes

blefaudeux approved these changes Apr 14, 2021

View reviewed changes

Anjali Sridhar added 4 commits April 14, 2021 23:11

modify tests for now

00fcfd5

Merge branch 'master' into revert-sync-shard

86a3da2

resolve conflicts

52fcbe7

resolve conflicts

7205ad1

anj-s merged commit a77c56f into master Apr 15, 2021

anj-s deleted the revert-sync-shard branch April 15, 2021 02:50

min-xu-ai mentioned this pull request Apr 19, 2021

[chore] 0.3.5 release #616

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. #608

[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. #608

anj-s commented Apr 14, 2021 •

edited

Loading

blefaudeux Apr 14, 2021

blefaudeux Apr 14, 2021

anj-s Apr 14, 2021

blefaudeux Apr 14, 2021

blefaudeux left a comment

		@@ -292,6 +292,75 @@ def backward(ctx, *grad_outputs): # type: ignore
		return (None, None) + grads


		class ShardSyncLayer(torch.autograd.Function):

[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. #608

[fix] Revert change that removed the option to run OffloadModel with out activation checkpointing. #608

Conversation

anj-s commented Apr 14, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

blefaudeux Apr 14, 2021

Choose a reason for hiding this comment

blefaudeux Apr 14, 2021

Choose a reason for hiding this comment

anj-s Apr 14, 2021

Choose a reason for hiding this comment

blefaudeux Apr 14, 2021

Choose a reason for hiding this comment

blefaudeux left a comment

Choose a reason for hiding this comment

anj-s commented Apr 14, 2021 •

edited

Loading